Basis Functions and Generalized Additive Models
University of Zurich
Mid-Atlantic wage data from ISLR.
There appears to be a curvilinear relationship.
If we miss this, we may under-fit the data and generate bias error.
But how do we build non-linear relationships into a “linear” regression model?
We use basis functions in the pre-processing stage.
Basis Functions as Transformations
Consider the predictive feature \(x\). We transform this feature using the basis function \(B_j(x)\), which is specified ahead of the learning process. This function can be used to introduce curvature into a model.
A basis set is a set of different basis functions each applied to \(x\) and added together to approximate a complex function so that
\[ y_i = \beta + \omega_1 B_1(x_i) + \omega_2 B_2(x_i) + \cdots + \omega_K B_K(x_i) + \varepsilon_i \qquad(1)\]
Basis Function
For an integer \(j\),
\[ B_j(x) = x^j \qquad(2)\]
The resulting regression model takes the form of
\[ y_i = \beta + \sum_{j=1}^K \omega_j x_i^j + \varepsilon_i \]
where \(K\) is set by the researcher.
Raw polynomials take the form of \(x^j\), since they simply transform the raw feature scores.
Since this can induce multicollinearity, orthogonal polynomials are preferred.
The polynomial is applied to the whole of age range.
What if we believe that different polynomials apply to different ranges?
For instance, we divide age into two regions such as \(\text{age} < 50\) and \(\text{age} \geq 50\).
We can now apply two different polynomials, for instance,
\[ \begin{split} y_i|\text{age} < 50 &\approx \beta_1 + \omega_{11} \text{age}_i + \omega_{12} \text{age}_i^2 \\ y_i|\text{age} \geq 50 &\approx \beta_2 + \omega_{21} \text{age}_i + \omega_{22} \text{age}_i^2 \end{split} \]
The problem is that this setup creates a jump at age = 50.
Moreover, there are other discontinuities at the cutoff of 50.
Imagine we impose \(k\) cutoff points or knots that divide a predictive feature into \(k+1\) regions.
A polynomial spline or \(B\)-spline of degree \(d\) is a piece-wise polynomial, such that at each knot we impose continuity on the derivatives up to degree \(d\).
Instead of tuning \(d\) and \(k\) separately, often we tune the degrees-of-freedom.
For \(B\)-splines, \(df = d + k\).
The default is typically the cubic spline, meaning that \(df = 3 + k\).
We typically set \(df\) through re-sampling.
We see a lot of variance at the ends of the cubic spline.
This has to do with a lack of constraint at the end points.
We can reduce this variance through the use of natural splines.
Here the functional form becomes linear below the smallest knot and above the largest knot.
We can combine various functions for different predictive features into an additive model:
\[ y_i \approx \beta + \sum_{j=1}^P f_j(x_{ij}) \] where \(f_j(.)\) is a spline with its own set of weights.
This is known as a generalized additive model or GAM (Hastie & Tibshirani, 1987).
The objective is to fit a polynomial regression of degree 3 to the age predictor in a wage model.
poly_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
model_flow <- workflow() %>%
add_model(poly_spec) %>%
add_recipe(model_recipe)
wage_fit <- fit(model_flow, data = training_set)
tidy(wage_fit)# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 112. 0.849 131. 0
2 age_poly_1 403. 40.3 10.0 4.43e-23
3 age_poly_2 -410. 40.3 -10.2 8.51e-24
4 age_poly_3 111. 40.3 2.76 5.74e- 3
Let us train a linear model of wage with age, education, and year of data collection as predictive features.
We create dummy variables for education.
We use natural splines for age and year.
We tune the degrees of freedom for those splines.
wage_recipe <- recipe(wage ~ age + education + year,
data = training_set) %>%
step_normalize(c(age,year)) %>%
step_dummy(education) %>%
step_ns(age, deg_free = tune("age")) %>%
step_ns(year, deg_free = tune("year"))
extract_parameter_set_dials(wage_recipe)Collection of 2 parameters for tuning
identifier type object
age deg_free nparam[+]
year deg_free nparam[+]
final_flow <- first_flow %>%
finalize_workflow(select_best(ns_tune))
final_est <- final_flow %>%
fit(training_set)
tidy(final_est)# A tibble: 11 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 45.8 5.33 8.58 1.71e-17
2 education_X2..HS.Grad 11.1 2.86 3.90 9.91e- 5
3 education_X3..Some.College 24.4 3.01 8.09 9.49e-16
4 education_X4..College.Grad 37.2 3.00 12.4 3.25e-34
5 education_X5..Advanced.Degree 62.4 3.24 19.2 2.10e-76
6 age_ns_1 47.2 4.85 9.74 5.79e-22
7 age_ns_2 38.2 5.88 6.50 1.00e-10
8 age_ns_3 34.6 5.11 6.77 1.61e-11
9 age_ns_4 51.0 12.2 4.18 3.03e- 5
10 age_ns_5 10.9 9.57 1.14 2.55e- 1
11 year_ns_1 12.9 2.78 4.63 3.96e- 6